Understanding Image and Text Simultaneously: a Dual Vision-Language Machine Comprehension Task
نویسندگان
چکیده
We introduce a new multi-modal task for computer systems, posed as a combined vision-language comprehension challenge: identifying the most suitable text describing a scene, given several similar options. Accomplishing the task entails demonstrating comprehension beyond just recognizing “keywords” (or key-phrases) and their corresponding visual concepts. Instead, it requires an alignment between the representations of the two modalities that achieves a visually-grounded “understanding” of various linguistic elements and their dependencies. This new task also admits an easy-to-compute and well-studied metric: the accuracy in detecting the true target among the decoys. The paper makes several contributions: an effective and extensible mechanism for generating decoys from (humancreated) image captions; an instance of applying this mechanism, yielding a large-scale machine comprehension dataset (based on the COCO images and captions) that we make publicly available; human evaluation results on this dataset, informing a performance upper-bound; and several baseline and competitive learning approaches that illustrate the utility of the proposed task and dataset in advancing both image and language comprehension. We also show that, in a multi-task learning setting, the performance on the proposed task is positively correlated with the endto-end task of image captioning.
منابع مشابه
Scaffolding Comprehension and Recall Gaps: Effects of Paratextual Advance Organizers
Although filling the gap in reading comprehension gained momentum with the rise of the top-down approach, Vygotsky’ concept of scaffolding and the dual code theory provided a strong support for the use of paratext to enhance comprehension. Scaffolding is dependent on other-regulation, one type of which is object-regulation. From this vantage-point, various types of paratext can function as sou...
متن کاملAssessing Reading Comprehension of Expository Text across Different Response Formats
This study investigated if different response formats (test methods) measure reading comprehension of expository text differently. The study was conducted with 48 semester 6 TESL students at a university in Selangor, Malaysia. These students received an expository passage having descriptive rhetorical structure followed by three response formats, namely, incomplete outline, graphic organizer, a...
متن کاملLearning Answer-Entailing Structures for Machine Comprehension
Understanding open-domain text is one of the primary challenges in NLP. Machine comprehension evaluates the system’s ability to understand text through a series of question-answering tasks on short pieces of text such that the correct answer can be found only in the given text. For this task, we posit that there is a hidden (latent) structure that explains the relation between the question, cor...
متن کاملAttention-Based Convolutional Neural Network for Machine Comprehension
Understanding open-domain text is one of the primary challenges in natural language processing (NLP). Machine comprehension benchmarks evaluate the system’s ability to understand text based on the text content only. In this work, we investigate machine comprehension on MCTest, a question answering (QA) benchmark. Prior work is mainly based on feature engineering approaches. We come up with a ne...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1612.07833 شماره
صفحات -
تاریخ انتشار 2016